A noninvasive artificial neural network model to predict IgA nephropathy risk in Chinese population

Renal biopsy is the gold standard for Immunoglobulin A nephropathy (IgAN) but poses several problems. Thus, we aimed to establish a noninvasive model for predicting the risk probability of IgAN by analyzing routine and serological parameters. A total of 519 biopsy-diagnosed IgAN and 211 non-IgAN patients were recruited retrospectively. Artificial neural networks and logistic modeling were used. The receiver operating characteristic (ROC) curve and performance characteristics were determined to compare the diagnostic value between the two models. The training and validation sets did not differ significantly in terms of any variables. There were 19 significantly different parameters between the IgAN and non-IgAN groups. After multivariable logistic regression analysis, age, serum albumin, serum IgA, serum immunoglobulin G, estimated glomerular filtration rate, serum IgA/C3 ratio, and hematuria were found to be independently associated with the presence of IgAN. A backpropagation network model based on the above parameters was constructed and applied to the validation cohorts, revealing a sensitivity of 82.68% and a specificity of 84.78%. The area under the ROC curve for this model was higher than that for logistic regression model (0.881 vs. 0.839). The artificial neural network model based on routine markers can be a valuable noninvasive tool for predicting IgAN in screening practice.

Algorithm of BP-ANN. We explored the relationship between risk factors and IgAN using the BP-ANN model. The BP-ANN was composed of three layers: the input layer, hidden layer, and output layer. The input layer of the ANN consisted of the variables showing statistical significance in the logistic regression analysis. The output layer referred to one neuron representing the presence of IgAN (valued as end = 1 for IgAN, and end = 0 for non-IgAN). The entire group was divided into a training group (70%) and a validation group (30%) using a random number generator. Back propagation of the error was used to dynamically adjust the network weights until the error was satisfied.
Statistical analysis. Statistical analysis was performed using SPSS version 19.0. Normally distributed data were expressed as x ± s (mean ± SD) and compared using the unpaired Student's t-test. The non-normally distributed data were expressed as medians with their corresponding interquartile ranges and compared using the Mann-Whitney U-test. Categorical variables were expressed as proportions (percentages) and compared using Chi-square tests. A value of P < 0.05 was considered to indicate a statistical difference. Statistically significant indicators from the univariate analysis were used as independent variables in the logistic regression model. Receiver operating characteristic (ROC) curves were then plotted, and the area under the curve (AUC) was calculated. The ANN models were developed using MATLAB 7.4.0. The predictive level of the model was evaluated based on the AUC, sensitivity, and specificity values.
Ethics approval. This study was approved by the ethics committee of the First Hospital of Jilin University, Changchun, China (2021-036).

Consent to participate. Written informed consent was provided from all participants.
Consent for publication. Consent for publication can be obtained from participants.

Statement of methods.
All methods were carried out in accordance with relevant guidelines and regulations.

Results
Clinical characteristics. A total of 730 patients with a primary glomerular diseases were enrolled in this study. The pathological types of the non-IgAN cases included membranous nephropathy, mesangial capillary glomerulonephritis, focal segmental glomerulosclerosis, and minimal change disease. The flow diagram of subjects screening and grouping is shown Fig. 1. The training cohort consisted of 511 patients (310 with IgAN and 201 with non-IgAN), of which 45.5% were male, with an average age of 39 years (range, 29.0-51.8 years). The validation set consisted of 219 patients (127 with IgAN and 92 with non-IgAN), of which 47% were male, with an average age of 40 years (range, 30.0-52.0 years). As shown in Table 1, there were no statistical differences in any clinical characteristics between the training and validation cohorts.
As presented in Table 2, compared with the non-IgAN group patients in the training cohort, the IgAN group patients were significantly younger on average, had a higher incidence of hematuria and hypertension, and had higher levels of serum albumin, urea nitrogen, creatinine, uric acid, IgA, IgG, and IgA/C3 ratios (P < 0.01). The IgAN group patients also had significantly lower levels of serological IgM, complement C3, total cholesterol, triglycerides, high-density lipoprotein (HDL) cholesterol, low-density lipoprotein(LDL) cholesterol, and hemoglobin, as well as a significantly lower estimated glomerular filtration rate (eGFR) and 24-h urine protein (P < 0.01).  Fig. 2A). When applied to the test dataset, the logistic regression model showed an AUC of 0.839, a sensitivity of 81.9%, and a specificity of 83.7% (Fig. 2B).

BP-ANN model prediction of IgAN.
A BP-ANN model was constructed using the training data. Based on the multivariable logistic regression results, seven significant factors were chosen as independent variables. The structure of BP-ANN model and network training process were shown in Fig. 3. The ROC curve was then obtained (Fig. 4A), and the BP-ANN model was found to provide a good predictive performance, with an AUC, sensitivity, and specificity of 0.965, 84.78%, and 94.53%, respectively. The predictive efficacy of the model was further evaluated using the validation set. In the validation cohort, the AUC, sensitivity, and specificity of the model were 0.881, 82.68%, and 84.78%, respectively (Fig. 4B).

Comparison of the BP-ANN and logistic regression models. The evaluation indexes of the BP-ANN
and the logistic regression models were compared. AUC values were obtained from the logistic regression and BP-ANN models using the validation set for IgAN prediction. The AUC value of the BP-ANN model was 0.881, which was higher than that of the logistic regression model, indicating the superior performance of the constructed neural network in IgAN prediction.

Discussion
The clinical manifestations of IgAN vary from asymptomatic hematuria or proteinuria in the early stages to rapid-onset ESRD in the late stages. IgAN is generally immune-mediated by increased aberrantly glycosylated IgA1 and subsequent complement C3 deposits in the glomerular mesangium 13 . Although renal biopsy is the gold  www.nature.com/scientificreports/ standard for diagnosing IgAN, its clinical application is limited in less-developed areas in China and by some patients' insufficient awareness of its necessity. Galactose-deficient IgA1, a peptide mass fingerprint, has been reported as new specific indicators for the diagnosis of IgAN 14 . However, their detection costs are expensive, and technology requirements of the operation are so high that they are difficult to apply in clinical practice. Therefore, exploring the clinical and laboratory indicators related to IgAN and constructing noninvasive prediction models to screen patients with high risk are of great significance. Through a retrospective cohort study, we identified the risk factors related to IgAN, built the diagnostic models, and evaluated the predictive ability of different modeling algorithms. In this study, compared with non-IgAN patients, IgAN was found to usually occur in young and middle-aged people, who were more likely to have hematuria, proteinuria, and hypertension. Most patients had elevated serum immunoglobulins (especially IgA), decreased complement C3, and renal injury, which are well-known features of IgAN 15 . Most scholars now agree that the serum IgA/C3 ratio is more valuable than serum IgA and C3 for the diagnosis and monitoring of IgAN [16][17][18] .Therefore, the serum IgA/C3 ratio was included in this study. In addition, serum IgG levels in the IgAN group were significantly higher than those in the non-IgAN group, which is consistent with the findings of previous literature 19 . In the training dataset, 70% of patients in the non-IgAN group had nephrotic syndrome, while only 22% of patients in the IgAN group had this syndrome. Because of the low proportion of nephrotic syndrome patients with IgAN, it is speculated that indicators related to nephrotic syndrome may be helpful for the differential diagnosis of IgAN and non-IgAN 20 , which also explains the high lipid and proteinuria levels and low serum albumin levels in the non-IgAN group. Logistic regression analysis was used to control for confounding factors, and seven variables, such as age, serum IgA/C3 ratio, serum albumin, serum IgA, serum IgG, eGFR, and the presence of hematuria, were found to be independent predictors of IgAN. Of these, the finding that serum IgA/C3 ratio can help diagnose IgAN is in line with previous related studies 21,22 . Originally, Maeda reported that the serum IgA/C3 ratio, combined with microscopic hematuria and/ or proteinuria and high serum IgA levels, can be used to distinguish IgAN from other primary renal diseases 23 . In 2012, Gao's team used logistic regression analysis for the differential diagnosis of IgAN and non-IgAN, by incorporating three factors: serum IgA, fibrinogen, and clinical presentation with an AUC of 0.838 24 . However, the sample size of their study was small, and therefore, the reliability of the results obtained need further verification. Later, Han QX incorporated age, serum IgA, total cholesterol, D-dimer, and fibrinogen into a logistic regression model for the noninvasive differential diagnosis of IgAN. However, the model was not validated, and therefore, its accuracy remains unverified 25 . In contrast, our study enrolled a large number of patients for the combined diagnosis of IgAN through the multiple predictors mentioned above, with an AUC of 0.92, sensitivity of 84.1%, and specificity of 91.4%. We tested the model on the validation set and obtained AUC, sensitivity, and specificity of 0.839 (more than 0.7), 81.9%, and 83.7% (more than 70%), respectively. Our results indicated that the multifactor-based logistic regression model can effectively predict the risk of IgAN.
However, logistic regression models cannot handle complex nonlinear relationships between inputs and outputs, nor can they detect all possible interactions between predictors. A logistic regression model can only work if the states of all the variables are known, which is often difficult to achieve in clinical practice. In contrast, ANNs have strong nonlinear mapping capability and can handle the complex intrinsic relationships between the missing data and variables. Furthermore, ANN models have been successfully used for prediction and classification in different areas, including informatics and medicine [26][27][28] . In this study, an ANN model for the early screening of IgAN was constructed and validated for the first time based on routine and serum markers. A ROC www.nature.com/scientificreports/ curve was used to assess the efficacy of the model in predicting the risk of IgAN. The AUC of the validation cohort was very similar to that of the training set, and both were significantly higher than those of the logistic regression model, indicating that the ANN model has better diagnostic performance in differentiating IgAN from non-IgAN. The results showed that the ANN model is more suitable for predicting the risk of IgAN than non-IgAN. Thus, it can be concluded that the ANN model has better clinical usability, as an auxiliary tool for early discovery and timely treatment. This study developed and validated a predictive model for screening the high risk of IgAN with the following advantages: (1) all patients enrolled had primary glomerular diseases confirmed by renal biopsy; (2) combining serum IgA/C3 ratio with age, serum albumin, total cholesterol, and hematuria to establish a predictive model reduced the limitations of using only the serum IgA/C3 ratio as the differential indicator; (3) with the same modeling variables, a simple, safe, and accurate predictive model for IgAN was developed that has good prospects for clinical application.
However, we have to point out some limitations: (1) there could have been selection bias and information bias owing to the retrospective nature of study design; (2) the small size of the cohort could have influenced the model performance to some degree. Our research objective would be better addressed using a larger validation cohort in a multicenter study; and (3) the model can not determine the grade of IgAN. which has a certain impact on the diagnosis and treatment. (4) The model has not been validated in an external independent cohort.
In conclusion, the established multifactor diagnostic model could effectively distinguish IgAN patients from non-IgAN patients with good specificity. The ANN noninvasive diagnostic model can predict IgAN better than logistic regression and may have good clinical applicability. This model can be helpful for early detection of high-risk IgAN patients especially in less-developed regions. www.nature.com/scientificreports/

Data availability
The datasets used and analysed during the current study are available from the corresponding author on reasonable request.